Sherwin Chen
by Sherwin Chen
3 min read

Categories

Tags

Introduction

Energy-based regularization has previously shown to improve both exploration and robustness in changing sequential decision-making tasks. It does so by encouraging policies to spread probability mass on all actions equally. However, entropy regularization might be undesirable when actions have significantly different importance. For example, some actions may be useless in certain tasks and uniform actions in this case would introduce fruitless exploration. Jordi Grau-Moya et al. 2019 propose a novel regularization that dynamically weights the importance of actions(i.e. adjusts the action prior) using mutual information.

Derivation of Mutual-Information Regularization

In the previous post, we framed the reinforcement learning problem as an inference problem and obtain the soft optimal objective as follows

maxπ0,πJ(π0,π)=Tt=1γt1Est,atπ[r(st,at)τlogπ(at|st)π0(at)]

where τ is the temperature, π(at|st) is the action distribution of the policy to optimize, and the action prior π0(at) is put back since we no longer assume it’s a uniform distribution here. Moving the expectation inward, we obtain

maxπ0,πJ(π0,π)=Tt=1γt1{Est,atπ[r(st,at)]τEst,atπ[logπ(at|st)π0(at)]} =Tt=1γt1{Est,atπ[r(st,at)]τstπ(st)DKL(π(a|st)π0(a))} =Tt=1γt1Est,atπ[r(st,at)]τsπ(s)DKL(π(a|s)π0(a)) maxπJ(π)=Tt=1γt1Est,atπ[r(st,at)]τIπ(S;A) whereπ(s)=1ZTt=1γt1stπ(st) =1ZTt=1γt1s1,a1,,st1,at1π(s1)(t2t=0π(at|st)P(st+1|st,at))π(at1|st1)P(s|st1,at1)

Here, π(s) is the discounted marginal distribution over states. We kind of abuse the equality sign in Equation (2) as we implicitly add the partition function 1Z to π(s). This, however, does not change our optimization objective since it’s a constant and can easily be corrected through τ.

From Equation (2), we can see that maximizing J(π0,π) w.r.t. π0 equally minimizes sq(s)DKL(π(a|s)π0(a)), which is minimized when π0(a)=π(a):

π(s)DKL(π(a|s)π0(a))π(s)D(π(a|s)π(a)) =s,aπ(s,a)(logπ(a|s)π0(a)logπ(a|s)π(a)) =s,aπ(s,a)logπ(a)π0(a) =aπ(a)logπ(a)π0(a) =DKL(ππ0)0

Therefore, we can derive Equation (3) by substitute π for π0 in Equation (2).

Equation.(3) suggests that when the action distribution of the current policy is taken as the action prior, our soft optimal objective now penalizes the mutual information between state and action. Intuitively, this means that we want to discard information in s irrelevant to the agent’s performance.

With the form of the optimal prior for a fixed policy at hand, one can easily devise a stochastic approximation method (e.g. π0(a)=(1α)π0(a)+απ(a|s) with α[0,1] and sπ(s)) to estimate the optimal prior π0(a) from the current estimate of the optimal policy π(a|s).

Mutual Information Reinforcement Learning Algorithm

Now we apply mutual information regularization to deep Q-networks(DQN). The algorithm, Mutual Information Reinforcement Learning(MIRL), makes five updates to traditional DQN:

Initial Prior Policy: For an initial fixed policy π(a|s), we compute π0(a) by minimizing sπ(s)DKL(π(a|s)π0(a)).

Prior Update: We approximate the optimal prior by employing the following update equation

π0(a)=(1απ0)π0(a)+απ0π(a|s)

where sπ(s) and απ0 is the step size.

Q-function Updates: Concurrently to learning the prior, MIRL optimizes Q-function by minimizing the following loss

L(θ):=Es,a,r,sreplay[(Tπ0softQ(s,a,s)π(s,a))2] where(Tπ0softQ)(s,a,s):=r(s,a)+γτlogaπ0(a)exp(Q(s,a)/τ)

where Q denotes the target Q-function

Behavioral Policy: MIRL’s behavioral policy consists of two parts: when exploiting, it takes greedy action based on the soft optimal policy; when exploring, it follows the optimal prior distribution. Mathematically, given a random sample uUniform[0,1] and epsilon ϵ, the action is obtained by

a={argmaxaπ(a|s)if u>ϵ aπ0()if uϵ whereπ(a|s)=1Zπ0(a)exp(π(s,a)/τ)

τ Update: the temperature τ controls penalty for the mutual information between state and action. As one might expect, it should be large at first and gradually anneals down during training process to ensure initial exploration. MIRL updates the inverse of τ, β=1τ, according to the inverse of the empirical loss of the Q-function

β=(1αβ)β+αβ(1L(θ))

where αβ is the step size. The intuition is that we want τ to be large when the loss of the Q-function is large. Note that a large Q-function loss stems from either inaccurate estimate of Q or large Q-values. Both cases makes sense of a large τ. However, regarding the different scale of τ and L(θ), an additional multiplier may be desirable.

Algorithm

Now it is straightforward to see the whole algorithm

References

Grau-Moya, Jordi, Felix Leibfried, and Vrancx Peter. 2019. “Soft \pi-Learning with Mutual-Information Regularization.” ICLR 2019, 1–9.